fix(guard): handle actor stopped during ws/req #3481

MasterPtato · 2025-11-18T23:43:51Z

No description provided.

vercel · 2025-11-18T23:43:56Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivet-site	Ready	Preview	Comment	Nov 20, 2025 8:55pm
rivetkit-serverless	Ready	Preview	Comment	Nov 20, 2025 8:55pm

2 Skipped Deployments

Project	Deployment	Preview	Comments	Updated (UTC)
rivet-cloud	Ignored			Nov 20, 2025 8:55pm
rivet-inspector	Ignored	Preview		Nov 20, 2025 8:55pm

MasterPtato · 2025-11-18T23:44:07Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

chore(rivetkit): implement new hibernating ws protocol #3464 : 2 dependent PRs (#3452 , #3487 )
fix(actor-kv): fix listing keys #3484
fix(guard): handle actor stopped during ws/req #3481 👈 (View in Graphite)
chore: rename retry -> hibernate in runner protocol #3473
fix(udb): update backoff algo #3469
chore: write/update docs #3468
chore: add debug scripts #3467
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-18T23:47:30Z

Pull Request Review

Summary

This PR fixes a race condition where actor stops during WebSocket connections or HTTP requests weren't being properly handled, leading to requests hanging indefinitely. The fix adds subscriptions to the actor::Stopped workflow event and properly terminates pending operations when actors stop.

✅ Positive Aspects

Critical Bug Fix: Addresses a real race condition where requests/WebSockets could hang if an actor stops during operation
Consistent Error Handling: Uses appropriate error types (ServiceUnavailable, WebSocketServiceUnavailable) that match existing patterns
Hibernation Awareness: Correctly handles the hibernation case for WebSockets - hibernating if allowed, otherwise returning unavailable
Proper Logging: Good use of structured logging with tracing::debug! to track when actors stop during operations

🔍 Code Quality Observations

1. Duplicate Subscription Pattern

In lib.rs, the stopped_sub subscription is created in three places. Consider creating the subscription once at the beginning of handle_request and handle_websocket methods and reusing it throughout the function to reduce redundant subscription overhead.

2. Runner WebSocket Behavior Change (Important)

In pegboard-runner/src/lib.rs:246-252, the behavior was changed so WebSockets are NOT notified when the runner WS closes. This means WebSocket connections will remain open from the gateway perspective even when the runner disconnects, waiting only for the actor to stop. Recommendation: Add integration tests to verify this behavior works correctly.

3. Comment Accuracy

Line 385 comment changed from "Send reclaimed messages" to "Send pending messages" - good improvement in clarity! ✅

🐛 Potential Issues

1. Subscription Timing in Hibernation Flow

The stopped_sub is created at line 290 for WebSocket open handling, but if after_hibernation is true (line 310), the WebSocket open logic is skipped. Verify that the handle_websocket_hibernation_inner method subscribes to the Stopped event, or document why this race is acceptable.

2. Error Propagation During Cleanup

When sending the WebSocket close message fails, only an error log is produced. Should this error be propagated? If the close message fails to send, the runner might not clean up properly.

🔒 Security Considerations

No security issues identified. The changes properly handle error cases and don't introduce new attack vectors. ✅

⚡ Performance Considerations

Subscription Overhead: Multiple subscriptions to the same event stream could add overhead. Consider optimizing by reusing subscriptions.
HashMap Lookups: The has_pending_websocket_messages method performs a HashMap lookup on the hot path during hibernation.

🧪 Test Coverage

Major Gap: No tests are included with this PR for critical race condition fixes.

Recommendations:

Unit tests: Mock the stopped_sub to fire during various stages
Integration tests: Simulate actor stopping during active request/WebSocket
Edge cases: Actor stops immediately after subscription, multiple concurrent requests

🎯 Recommendations Summary

Must Address:

Verify hibernation subscription coverage for stop events
Add tests for the actor stop scenarios

Should Consider:

Optimize subscription creation (reuse where possible)
Clarify intended behavior for WebSocket when runner disconnects but actor continues
Add integration tests for the new behavior

Final Verdict

This is a solid bug fix that addresses a real race condition. The implementation is clean and follows existing patterns. However, the lack of tests for such critical race condition fixes is concerning.

Recommended Action: Approve with the requirement that tests be added before merging, especially for the WebSocket-specific behavior changes.

Great work on identifying and fixing this race condition! 🚀

claude · 2025-11-18T23:52:16Z

PR Review: fix(guard): handle actor stopped during ws/req

Summary

This PR adds handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway, ensuring graceful cleanup when actors stop mid-request. The changes also improve WebSocket lifecycle management in the runner.

Code Quality ✅

Strengths:

Clear separation of concerns with stopped_sub subscription in both request and WebSocket handlers
Consistent error handling using appropriate error types (ServiceUnavailable, WebSocketServiceUnavailable, WebSocketServiceHibernate)
Follows existing patterns and conventions in the codebase
Good use of tokio::select! for concurrent event handling

Detailed Analysis

1. pegboard-gateway/src/lib.rs - Actor Stopped Handling

Lines 155-158, 290-293: Subscription creation

let mut stopped_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Stopped>(("actor_id", self.actor_id))
    .await?;

✅ Good: Subscribes to actor stopped events before processing requests
⚠️ Consideration: The subscription happens after reading the request body. If the actor stops during body reading, it won't be detected. Consider moving the subscription earlier if this is a concern.

Lines 220-223, 356-359, 436-444: Select arms for stopped events

_ = stopped_sub.next() => {
    tracing::debug!("actor stopped while waiting for request response");
    return Err(ServiceUnavailable.build());
}

✅ Good: Appropriate error types returned for each context
✅ Good: Debug logging helps with troubleshooting
✅ Good: Hibernation-aware logic in WebSocket handler (lines 436-444)

2. pegboard-gateway/src/shared_state.rs - Pending Messages Check

Lines 347-357: New has_pending_websocket_messages method

pub async fn has_pending_websocket_messages(&self, request_id: RequestId) -> Result<bool> {
    let Some(req) = self.in_flight_requests.get_async(&request_id).await else {
        bail!("request not in flight");
    };

    if let Some(hs) = &req.hibernation_state {
        Ok(!hs.pending_ws_msgs.is_empty())
    } else {
        Ok(false)
    }
}

✅ Good: Simple, focused method
✅ Good: Proper error handling for missing requests
✅ Good: Safe handling of optional hibernation state

Lines 609-616: Early wake optimization in handle_websocket_hibernation

// Immediately rewake if we have pending messages
if self
    .shared_state
    .has_pending_websocket_messages(unique_request_id.into_bytes())
    .await?
{
    return Ok(HibernationResult::Continue);
}

✅ Excellent: Prevents unnecessary hibernation when messages are pending
✅ Performance: Avoids spawning keepalive task unnecessarily

3. pegboard-runner/src/lib.rs - WebSocket Lifecycle Changes

Lines 246-279: Simplified close message handling

// Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
    // Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
    // instead they wait for the actor itself to stop.
    if req.is_ws {
        continue;
    }
    // ... send ToServerResponseAbort
}

✅ Good: Clear distinction between WebSocket and HTTP request lifecycle
✅ Good: Comprehensive comment explains the design decision
✅ Good: Removes complex WebSocket close logic that was previously trying to send close frames

Removed Code Analysis:

Removed lines 13, 246-262: Eliminated CloseFrame and CloseCode imports and complex close logic
✅ Simplification: The new approach is cleaner - WebSockets wait for actor stop signal rather than runner disconnect

Potential Issues & Suggestions

Minor Issues

Comment Typo (line 385):

// Send pending messages

Previous comment said "Send reclaimed messages" - good fix! ✅

Subscription Timing (lines 155-158):
The stopped_sub is created after reading the request body. If the actor stops during body reading (lines 148-153), it won't be detected immediately. Consider:

// Subscribe before reading body
let mut stopped_sub = self.ctx.subscribe::<...>(...).await?;

let body_bytes = req.into_body().collect().await
    .context("failed to read body")?
    .to_bytes();

Error Propagation Consistency:
All three stopped event handlers return errors appropriately, but consider if there are cases where you'd want to distinguish between "actor stopped" vs other ServiceUnavailable causes for metrics/logging.

Questions for Discussion

Race Condition: Is there a race between the actor stopping and the subscription being established? If an actor stops between handler start and subscription creation, the event might be missed.
Subscription Cleanup: The stopped_sub is dropped at the end of each function. Is this the desired behavior, or should subscriptions be maintained longer?
Testing: Are there integration tests that verify:
- Actor stops during request body reading?
- Actor stops while waiting for response?
- Actor stops during WebSocket message handling?
- Actor stops during hibernation?

Performance Considerations

✅ Good: The early wake check (lines 609-616) prevents unnecessary hibernation overhead
✅ Good: Using tokio::select! for concurrent event handling is efficient
⚠️ Minor: Creating a new subscription for each request/WebSocket connection could add overhead at scale. Consider connection pooling or subscription reuse if this becomes a bottleneck.

Security Concerns

✅ No security issues identified

Proper error handling prevents information leakage
No authentication/authorization changes
No unsafe code introduced

Test Coverage

⚠️ Missing Test Evidence: No test files were modified in this PR. Consider adding tests for:

Actor stopping during request handling
Actor stopping during WebSocket connection
Actor stopping during hibernation
Early wake when pending messages exist

Recommendations

Add Integration Tests: Cover the new actor stopped scenarios
Consider Subscription Timing: Move subscription creation before body reading if actor stop during body read is a concern
Add Metrics: Consider adding metrics to track how often actors stop mid-request for operational visibility
Documentation: Add a code comment explaining the actor stopped subscription pattern for future maintainers

Verdict

Approval: ✅ Approved with Suggestions

This is a solid PR that improves reliability by handling actor stopped events gracefully. The code follows existing patterns, has appropriate error handling, and includes a nice optimization for hibernation. The suggestions above are minor improvements that could be addressed in follow-up work.

Strengths:

Proper error handling
Good logging
Performance optimization (early wake)
Simplified WebSocket lifecycle

Action Items:

Consider adding integration tests
Consider moving subscription earlier to catch stops during body reading
Consider adding metrics for operational visibility

claude · 2025-11-18T23:52:59Z

Code Review: PR #3481 - fix(guard): handle actor stopped during ws/req

Summary

This PR adds proper handling for actor stop events during WebSocket and HTTP request processing in the pegboard gateway. The changes ensure graceful cleanup when actors stop while requests are in-flight.

Code Quality ✅

Strengths:

Consistent error handling: Properly returns ServiceUnavailable for HTTP requests and WebSocketServiceUnavailable/WebSocketServiceHibernate for WebSocket connections based on hibernation capability
Structured logging: Uses structured logging correctly with tracing::debug!(?x, "message") pattern, following CLAUDE.md conventions
Lifecycle management: The optimization in handle_websocket_hibernation (lib.rs:609-616) to immediately rewake when pending messages exist is a good performance improvement
Code organization: Subscribes to the Stopped signal early in both handle_request and handle_websocket methods, ensuring proper cleanup

Potential Issues ⚠️

1. Subscription Created After Request Start (Medium Priority)

In handle_request (lib.rs:155-158), the stopped_sub subscription is created after reading the request body but before starting the in-flight request. If the actor stops between reading the body and subscribing, there's a small race condition window.

Recommendation: Move the subscription to the earliest point possible.

2. WebSocket: No Stopped Handling During Hibernation (Medium Priority)

In handle_websocket_hibernation_inner (lib.rs:668-699), the method only listens for Ready events and WebSocket messages, but doesn't handle the case where the actor stops during hibernation.

Recommendation: Add a stopped_sub to the hibernation select to handle actor stops during hibernation.

3. Runner-side: Changed Semantics (Low Priority)

In pegboard-runner/src/lib.rs:249-253, WebSockets no longer receive close messages when the runner WebSocket closes. Consider documenting this behavior more clearly.

Performance Considerations ✅

Positive:

Early check for pending messages in hibernation avoids unnecessary keepalive task spawn
Subscriptions are cleaned up properly via tokio::select patterns

Security Concerns ✅

No security issues identified. Error messages don't leak sensitive information.

Test Coverage ⚠️

Missing:

No test files found in pegboard-gateway package
This PR adds critical error handling logic that should be tested

Recommendations:

Add integration tests for actor stop scenarios
Test the race condition scenarios
Verify that pending messages are properly handled when actor stops

Overall Assessment

This is a solid improvement that addresses a real gap in actor lifecycle handling. The code quality is good and follows repository conventions. The main concern is the lack of test coverage and a couple of edge cases in the subscription timing.

Approval Status: ⚠️ Request Changes (add tests + fix race conditions)

cc: @MasterPtato

🤖 Generated with Claude Code

claude · 2025-11-19T01:30:48Z

PR Review: fix(guard): handle actor stopped during ws/req

Overview

This PR addresses an important issue where the gateway doesn't properly handle actor stop events during WebSocket and HTTP request processing. The implementation subscribes to actor Stopped events and gracefully terminates connections when actors are no longer available.

Code Quality & Best Practices

✅ Strengths

Consistent Error Handling: The PR correctly uses appropriate error types:
- ServiceUnavailable for HTTP requests
- WebSocketServiceUnavailable for WebSocket connections (when hibernation not enabled)
- WebSocketServiceHibernate for WebSocket connections (when hibernation enabled)
Follows Repository Patterns:
- Uses structured logging correctly (e.g., tracing::debug!("actor stopped...") without formatting into the message)
- Adheres to the error handling patterns defined in the codebase
- Properly uses tokio::select! for concurrent event handling
Good Code Organization: The changes are well-scoped and localized to the affected modules

🔍 Potential Issues

1. Logic Error in `is_ws_hibernate` check (Critical)

Location: engine/packages/pegboard-gateway/src/lib.rs:562-566

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Issue: The logic appears inverted. map_or_else takes two closures:

First closure runs when None (the error case)
Second closure runs when Some (the success case)

This means:

If lifecycle_res is Err(...), it checks is_ws_hibernate(err) ✓
If lifecycle_res is Ok(...), it returns false ✓

But the condition sends a close frame when the result is false, which means:

Close frames are sent for successful completions
Close frames are NOT sent for hibernation errors

Expected behavior: Close frames should NOT be sent when hibernating, but SHOULD be sent otherwise.

Suggested fix:

// Send close frame to runner if NOT hibernating
if !lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Or more clearly:

let should_send_close = match &lifecycle_res {
    Err(err) => !is_ws_hibernate(err),
    Ok(_) => true,
};

if should_send_close {
    // ... send close message
}

2. Subscription Timing

Location: Multiple locations in lib.rs

The PR creates stopped_sub subscriptions after reading the request body (for HTTP) or after header extraction (for WebSocket). This creates a potential race condition:

let body_bytes = req.into_body().collect().await?; // Actor could stop here
let mut stopped_sub = self.ctx.subscribe::<...>().await?; // We'd miss it

Impact: If the actor stops between request parsing and subscription, the stop event would be missed and the request would timeout instead of failing immediately.

Suggested improvement: Subscribe to actor stopped events as early as possible, before any async operations.

3. Immediate Wake on Pending Messages

Location: engine/packages/pegboard-gateway/src/lib.rs:614-621

if self
    .shared_state
    .has_pending_websocket_messages(unique_request_id.into_bytes())
    .await?
{
    return Ok(HibernationResult::Continue);
}

Observation: This is good defensive programming that prevents unnecessary hibernation when messages are already pending. However, there's no explanation in comments about why this is needed.

Suggestion: Add a comment explaining this check prevents a race condition where messages arrive just as hibernation begins.

4. Missing Stopped Subscription in Hibernation Handler

Location: engine/packages/pegboard-gateway/src/lib.rs:673-705

The handle_websocket_hibernation_inner function subscribes to actor::Ready events but NOT to actor::Stopped events. This means if an actor stops during hibernation, the WebSocket will remain hibernated until either:

The actor becomes ready again (unlikely if stopped)
The client sends a message or closes

Suggested improvement: Also subscribe to Stopped events during hibernation and close the connection if received.

Performance Considerations

✅ Good Practices

Efficient Subscription Management: The PR properly uses tokio::select! which is efficient for handling multiple async events concurrently.
No Blocking Operations: All operations are properly async, preventing thread blocking.

⚠️ Minor Concerns

Multiple Subscriptions: Each request/WebSocket creates a new subscription to actor stopped events. For actors with high request volume, this could create significant subscription overhead.
- Impact: Likely minimal in practice since subscriptions are lightweight
- Consideration: Monitor UPS subscription count if this becomes a bottleneck

Security Concerns

✅ No Major Issues

Error messages don't leak sensitive information
Proper error propagation without exposing internal state
No authentication/authorization bypasses introduced

Test Coverage

⚠️ Missing Tests

The PR doesn't include any tests for the new actor stopped handling. Consider adding integration tests for:

HTTP Request: Actor stops while waiting for response
WebSocket Open: Actor stops during WebSocket handshake
WebSocket Active: Actor stops during active WebSocket connection
WebSocket Hibernation: Actor stops while WebSocket is hibernated
Hibernation with Pending Messages: Verify immediate wake behavior

Example test scenario:

#[tokio::test]
async fn test_actor_stopped_during_request() {
    // Setup: Create actor and start request
    // Action: Stop actor while request is in flight
    // Assert: Request returns ServiceUnavailable error
}

Runner Changes

✅ Improved Clarity

Location: engine/packages/pegboard-runner/src/lib.rs:246-279

The change to NOT send WebSocket close messages when the runner connection closes is correct:

// Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
// instead they wait for the actor itself to stop.
if req.is_ws {
    continue;
}

This properly distinguishes between:

HTTP requests: Ephemeral, should be aborted when runner disconnects
WebSockets: Persistent, rely on actor lifecycle events

The comment clearly explains this distinction. Good!

Documentation

Website Changes

The changes to website/public/llms*.txt appear to be unrelated documentation cleanup. This is fine but might warrant a separate commit for clarity.

Overall Assessment

Status: ⚠️ Needs Changes

The PR addresses a real issue and the overall approach is sound, but there's a critical logic error in the is_ws_hibernate check that needs to be fixed before merging. Additionally, the race condition with subscription timing and missing test coverage should be addressed.

Required Changes:

✅ Fix the inverted logic in the is_ws_hibernate check (line 562-566)

Recommended Changes:

Subscribe to actor stopped events earlier to prevent race conditions
Add stopped event handling during hibernation
Add test coverage for the new functionality
Add explanatory comments for the immediate wake optimization

Nice to Have:

Consider separating the documentation cleanup into a separate commit
Add performance monitoring for subscription overhead if deployed at scale

Great work identifying and fixing this issue! The core logic is solid, just needs the critical bug fix and some defensive improvements.

claude · 2025-11-20T22:38:27Z

Code Review

Summary

This PR implements proper handling of actor stops during WebSocket and HTTP requests in the pegboard gateway. The changes ensure that when an actor stops, in-flight requests and WebSocket connections are gracefully terminated instead of hanging.

Positive Aspects

✅ Core Fix is Sound: The addition of stopped_sub subscriptions to detect actor stop events is the correct approach. This properly handles the edge case where an actor stops while requests are being processed.

✅ Consistent Error Handling: The use of ServiceUnavailable for HTTP requests and conditional error handling for WebSockets (hibernate-capable vs non-hibernate) is appropriate.

✅ Optimization: The early check in handle_websocket_hibernation (lines 614-621) to immediately wake if there are pending messages is a good performance optimization.

✅ Proper Visibility Change: Making is_ws_hibernate public allows proper detection of hibernation errors when deciding whether to send close frames.

✅ Lifecycle Cleanup: The runner cleanup logic correctly skips WebSocket close messages when the runner itself disconnects, as WebSockets wait for actor stop events.

claude · 2025-11-20T22:38:35Z

Issues and Concerns

🔴 Critical: Logic Error in Close Frame Condition

Lines 562-566 in pegboard-gateway/src/lib.rs have an inverted condition:

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Problem: map_or_else evaluates the first closure (checking hibernation) when the result is Err, and returns false for Ok values. This means:

Close frames are sent when hibernating (should NOT send)
Close frames are NOT sent when not hibernating (should send)

Expected: Close frames should be sent to the runner UNLESS we're hibernating.

Fix Option 1 - Negate the condition:

if \!lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false) {

Fix Option 2 - Use clearer logic:

if lifecycle_res.as_ref().err().map_or(true, |e| \!is_ws_hibernate(e)) {

claude · 2025-11-20T22:38:50Z

Medium: Race Condition Risk in Request Handler

Lines 155-158 subscribe to actor stop events AFTER reading the request body. If the actor stops between when the request arrives and when we subscribe, we might miss the stop event.

Recommendation: Move the subscription to before reading the body (before line 148) to ensure we catch all stop events.

Medium: Inconsistent Subscription Timing

In handle_request, the stopped subscription is created early (line 155), but in handle_websocket it is created later (line 290) after header extraction. While this may work due to workflow event replay, it would be more consistent and defensive to create subscriptions as early as possible in both handlers.

Minor: Comment Accuracy

Line 385 comment changed from "Send reclaimed messages" to "Send pending messages". While "pending" is clearer, verify this change is intentional and accurate for the hibernation recovery flow.

Minor: Unrelated Documentation Changes

The changes to website/public/llms-full.txt and llms.txt appear unrelated to this PR (removing TypeDoc links, adding blog post URL). These should ideally be in a separate commit or PR for clarity.

claude · 2025-11-20T22:38:58Z

Performance and Security Considerations

Performance ✅

The subscriptions are properly cleaned up when the select branches complete
Early wake check prevents unnecessary hibernation cycles
No blocking operations introduced

Security ✅

No security vulnerabilities introduced
Proper error handling prevents information leakage
Actor stop events are subscription-based (no polling)

Test Coverage

No tests found for pegboard-gateway or pegboard-runner packages. Consider adding integration tests for:

Actor stopping during HTTP request
Actor stopping during WebSocket open handshake
Actor stopping during active WebSocket connection
Hibernation behavior with pending messages

Recommendations

URGENT: Fix the inverted close frame condition (line 562-566)
Move stopped subscription before body reading in handle_request
Consider consistency in subscription timing across handlers
Split unrelated documentation changes into separate commit
Add integration tests for actor stop scenarios

Code Quality

Overall code follows repository conventions well:

✅ Proper use of structured logging with tracing
✅ Correct error types from rivet-error
✅ Appropriate use of async/await patterns
✅ Good separation of concerns

The core logic is sound, but the inverted condition is a critical bug that must be fixed before merge.

claude · 2025-11-20T22:47:56Z

Code Review

This PR adds proper handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway. The changes look solid overall, but I have a few observations and suggestions:

Positive Aspects

Good defensive programming: Adding subscriptions to the actor::Stopped event prevents requests from hanging when an actor stops unexpectedly during processing.
Consistent error handling: The code appropriately returns ServiceUnavailable for regular requests and WebSocketServiceUnavailable/WebSocketServiceHibernate for WebSocket connections based on hibernation capability.
Early wake optimization: The check for pending messages before hibernating (line 614-621 in lib.rs) is a smart optimization to avoid unnecessary hibernation cycles.
Logical cleanup in runner: The removal of WebSocket close messages on runner disconnect (pegboard-runner/src/lib.rs:249-253) makes sense—WebSockets should wait for the actor itself to stop rather than the runner connection.

Issues and Concerns

1. Logic Error in Close Message Condition ⚠️ CRITICAL

Location: engine/packages/pegboard-gateway/src/lib.rs:562-566

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Problem: This condition is inverted. Currently it sends the close message when:

lifecycle_res is Err AND is_ws_hibernate(err) returns true
lifecycle_res is Ok(_) → returns false

This means the close message is only sent when hibernating, which is the opposite of the intended behavior based on the comment "Send close frame to runner if not hibernating".

Expected behavior: The close message should be sent in all cases EXCEPT when hibernating.

Fix: Negate the condition:

if \!lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Or more clearly:

let should_send_close = lifecycle_res
    .as_ref()
    .err()
    .map_or(true, |err| \!is_ws_hibernate(err));

if should_send_close {

2. Missing Error Propagation Context

Location: engine/packages/pegboard-gateway/src/lib.rs:152-159, 290-293

The new .await? calls for subscribing to actor stopped events do not include context. Following the codebase pattern:

let mut stopped_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Stopped>(("actor_id", self.actor_id))
    .await
    .context("failed to subscribe to actor stopped events")?;

3. Inconsistent Import Style

Location: engine/packages/pegboard-gateway/src/lib.rs:17

The new import breaks the established pattern:

proxy_service::{ResponseBody, is_ws_hibernate},

Following Rust conventions, function imports should maintain their module path or use a more explicit import:

use rivet_guard_core::proxy_service::{self, ResponseBody};
// Then use: proxy_service::is_ws_hibernate(err)

Or if direct import is preferred, it should be on a separate line for clarity.

4. Potential Race Condition

Location: engine/packages/pegboard-gateway/src/lib.rs:614-621

The check for pending messages happens before starting the keepalive task. There is a potential race where:

Check shows no pending messages
A message arrives
Hibernation starts
Message is missed until the next wake

Consider moving this check after setting up the subscription or handling it within the select loop.

5. Missing Test Coverage

The PR description shows 0 test-related changes. Given the critical nature of this fix (handling actor stopped during active requests), integration tests should be added to verify:

Request handling when actor stops mid-request
WebSocket handling when actor stops during open/message/close phases
Hibernation behavior when actor stops
Proper error types returned in each scenario

Performance Considerations

The additional subscriptions (stopped_sub) are created for every request/WebSocket connection. This is appropriate given the need to detect actor stopped events, but monitor subscription overhead in high-throughput scenarios.

Security

No security concerns identified. The changes maintain existing error handling patterns and do not introduce new attack surfaces.

Minor Suggestions

Line 387 comment: "Send pending messages" is clearer than "Send reclaimed messages" ✓ (already fixed in PR)
Logging consistency: Consider using ?actor_id in the stopped event log messages for consistency with other tracing calls in the codebase.

Summary

The core concept and implementation are sound, but the inverted condition on line 562-566 is a critical bug that would cause close messages to be sent only during hibernation instead of in all other cases. Please fix this before merging.

The lack of test coverage for these critical paths is also concerning. At minimum, add tests for the actor stopped scenarios to prevent regressions.

Review generated with Claude Code following CLAUDE.md conventions

claude · 2025-11-20T23:39:48Z

Code Review

Summary

This PR fixes a critical issue where actors could be stopped during WebSocket or HTTP request handling, potentially leaving clients hanging without proper error responses. The changes introduce proper actor lifecycle monitoring and graceful handling when actors are stopped mid-request.

✅ Positive Aspects

Critical Bug Fix: The PR addresses an important edge case where actor stoppage wasn't being handled during active requests/WebSocket connections
Consistent Error Handling: All three main code paths (request handling, WebSocket open, and WebSocket message loop) now consistently check for actor stoppage
Appropriate Error Responses: Returns ServiceUnavailable for requests and either WebSocketServiceHibernate (if hibernation is enabled) or WebSocketServiceUnavailable (if not) for WebSockets
Code Cleanup: Removed unnecessary close frame sending logic in pegboard-runner that was sending hibernation messages inappropriately
Hibernation Optimization: The new has_pending_websocket_messages function allows for immediate rewake during hibernation if messages are pending, avoiding unnecessary hibernation cycles

🔍 Issues & Concerns

1. Logic Error in Close Frame Sending (`lib.rs:563-566`)

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Issue: This logic is inverted. map_or_else takes two closures: one for the None case and one for the Some case. The current code:

When lifecycle_res is Err: calls is_ws_hibernate(err)
When lifecycle_res is Ok: returns false

This means it ONLY sends close frames when the error is NOT a hibernate error. But the condition should skip sending when it IS hibernating.

Expected behavior: You should send the close frame UNLESS we're hibernating. The correct logic should be:

if \!lifecycle_res
    .as_ref()
    .err()
    .map(is_ws_hibernate)
    .unwrap_or(false)
{

Or more clearly:

let should_skip_close = lifecycle_res
    .as_ref()
    .err()
    .map(is_ws_hibernate)
    .unwrap_or(false);

if \!should_skip_close {

2. Visibility Change May Need Documentation (`proxy_service.rs:2540`)

The function is_ws_hibernate was changed from private to pub. While this is necessary for the fix, consider:

Is this part of a public API that external consumers might use?
Should this be documented as part of the module's public interface?

3. Race Condition Potential

The subscription to actor::Stopped happens after the request body is read. While this is a very small window, there's a theoretical race where:

Actor stops
Body is read
Subscription is created (misses the stop event)
Request proceeds as if actor is still running

Mitigation: Consider subscribing before reading the request body, or document why this ordering is safe.

4. Inconsistent Comment (`lib.rs:385`)

// Send pending messages

The comment was changed from "Send reclaimed messages" to "Send pending messages". While "pending" is clearer, consider whether this accurately describes what resend_pending_websocket_messages does (it resends messages that were buffered during hibernation).

🎯 Performance Considerations

Multiple Subscriptions: Each request/WebSocket connection now creates a new subscription to the actor::Stopped event. This is reasonable for correctness, but monitor subscription overhead at scale.
tokio::select! Overhead: Adding another branch to the select loops adds minimal overhead, but it's worth noting for hot paths.

🔒 Security Considerations

No major security concerns. The changes properly propagate error conditions rather than leaving connections in undefined states, which is a security improvement.

📝 Test Coverage

Critical: No tests found for pegboard-gateway or pegboard-runner packages. This PR would benefit greatly from:

Unit tests for the new has_pending_websocket_messages function
Integration tests simulating:
- Actor stopping during HTTP request
- Actor stopping during WebSocket open handshake
- Actor stopping during WebSocket message exchange
- Proper hibernation behavior when actor stops

Consider adding tests in a follow-up PR if not in this one.

📚 Code Style & Conventions

✅ Follows CLAUDE.md conventions:

Uses structured logging with tracing::debug
Lowercase log messages
Proper error handling with anyhow
No formatting changes (as instructed)

🎨 Suggestions for Improvement

Extract magic numbers: u16::MAX / 2 in wrapping_gt could use a named constant
Add tracing spans: The new subscription/select blocks could benefit from spans for debugging
Consider metrics: Track how often actors are stopped mid-request for monitoring

🚀 Recommendation

Conditional Approval - The PR fixes an important issue and the approach is sound, but the logic error in line 563-566 must be fixed before merging. Please:

Fix the inverted logic for close frame sending
Add a test case for actor stoppage during WebSocket handling
Consider the race condition mitigation mentioned above

Once the logic error is addressed, this will be a solid improvement to the actor lifecycle handling.

Great work identifying and fixing this edge case! The consistent error handling across all code paths is particularly well done. 🎯

claude · 2025-11-21T00:56:48Z

Code Review - PR #3481

Summary

This PR fixes a critical race condition where actors could be stopped during WebSocket or HTTP request handling, leading to requests hanging indefinitely. The solution subscribes to the actor::Stopped event and returns appropriate errors when the actor stops mid-request.

Positive Observations

Excellent Race Condition Fix - The core issue is well-identified and properly addressed
Proper Error Handling - Correctly differentiates between hibernatable and non-hibernatable WebSocket connections
Clean Separation of Concerns - The pegboard-runner changes are correct
Good Logging - Debug logs properly placed with clear messages

Critical Issues

Issue 1: Logic Error in Close Frame Condition (line 562-566)

The condition appears inverted. The comment says send close frame if NOT hibernating, but the code will ONLY send close frames when there IS a hibernation error.

Current: if lifecycle_res.as_ref().map_or_else(is_ws_hibernate, |_| false)

This evaluates to true only when there is an error AND it is a hibernate error, which is backwards.

Issue 2: Missing Stopped Subscription During Hibernation (line 673-705)

When a WebSocket is hibernating, the code listens for actor::Ready events but not actor::Stopped events. If the actor stops during hibernation, the WebSocket will hang indefinitely until the client disconnects.

Recommendations

Fix the inverted close frame condition logic (critical)
Add stopped_sub handling during hibernation (critical)
Consider moving subscription creation earlier to avoid race conditions
Add test coverage for actor stop scenarios
Separate unrelated website documentation changes

Overall Rating

Needs Revision - The core fix is excellent, but issues 1 and 2 are critical bugs that could cause incorrect behavior in production.

claude · 2025-11-21T01:35:32Z

Code Review

This PR effectively handles the case where an actor stops during WebSocket or HTTP request processing. The implementation is solid overall, with good error handling and proper subscription management. Here are my findings:

✅ Strengths

Proper event subscription: The PR correctly subscribes to pegboard::workflows::actor::Stopped events in all three critical locations (HTTP requests, WebSocket open, and WebSocket message loops).
Consistent error handling: Returns appropriate errors (ServiceUnavailable for HTTP, WebSocketServiceUnavailable or WebSocketServiceHibernate for WebSockets depending on hibernation support).
Clean separation of concerns: The runner-side changes properly distinguish between ephemeral HTTP requests and persistent WebSocket connections.
Hibernation optimization: The new has_pending_websocket_messages check in handle_websocket_hibernation is a smart optimization to immediately rewake if messages are queued.
Fixed close frame logic: The conditional sending of close frames to runners (only when not hibernating) prevents unnecessary messages and properly uses the is_ws_hibernate helper.

🔍 Potential Issues

Logic inversion in close frame condition (lib.rs:563-566)
```
if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{
```
This logic appears inverted. The condition reads: "if the result is a hibernate error OR if it's Ok return false". This means the close frame is sent when:
- Result is Ok (any variant) → returns false → sends close frame ✅
- Result is Err and NOT hibernate → returns false → sends close frame ✅
- Result is Err and IS hibernate → returns true → skips close frame ✅
Wait, I think I misread this initially. Let me reconsider:
- map_or_else(is_ws_hibernate, |_| false) means: if Err, call is_ws_hibernate(err); if Ok, return false
- The outer if then inverts the meaning
Actually, this logic means: "Send close frame if it's NOT a hibernate error". But the condition should read more clearly. Consider:
```
let should_send_close = \!lifecycle_res
    .as_ref()
    .is_err_and(is_ws_hibernate);

if should_send_close {
    // send close frame
}
```
Subscription cleanup (lib.rs:155-158, 290-293, 436)
The stopped_sub subscriptions are created but never explicitly cleaned up. While Rust's drop semantics handle this, there are three subscriptions created per request/WebSocket:
- One in handle_request (line 155)
- One in handle_websocket (line 290)
- One used in the spawned task (line 436)
The spawned task subscription is moved into the task and shares the same lifetime, which is correct. However, consider documenting that subscriptions are automatically cleaned up on drop to make the resource management explicit.
Removed unused import (pegboard-runner/lib.rs:15)
Good cleanup removing CloseCode import, but verify that all error paths that previously used it now work correctly.

🎯 Best Practices

Logging: All the new error paths have appropriate debug-level logs ("actor stopped while waiting for..."). Excellent for debugging. ✅
Error messages: Clear and descriptive error messages. ✅
Following CLAUDE.md conventions:
- Uses structured logging with tracing correctly (?x for debug formatting) ✅
- Lowercase log messages ✅
- Proper error handling with Result return types ✅

🚀 Performance Considerations

Subscription overhead: Creating three UPS subscriptions per connection adds some overhead, but this is necessary for correctness. The subscriptions are lightweight and properly scoped.
Early wake optimization: The has_pending_websocket_messages check is a good optimization to avoid unnecessary hibernation cycles.

🔒 Security

No security concerns identified. The changes maintain proper error boundaries and don't expose internal state.

🧪 Test Coverage

Concern: No tests were added for this new functionality. Consider adding integration tests that:

Simulate an actor stopping during an HTTP request
Simulate an actor stopping during WebSocket open
Simulate an actor stopping during active WebSocket communication
Verify hibernating WebSockets properly handle actor stops

📝 Minor Suggestions

Consider using is_err_and for the close frame condition to improve readability
Add a comment explaining why WebSockets don't receive close messages when the runner disconnects (as noted in pegboard-runner/lib.rs:249-252)
The website/public/llms*.txt changes appear unrelated to this PR - consider removing them

Summary

Recommendation: Approve with minor suggestions

The core functionality is correct and handles the critical race condition properly. The main concern is test coverage, and there's a minor readability improvement opportunity in the close frame condition. The changes follow the codebase conventions well and integrate cleanly with the existing error handling patterns.

graphite-app · 2025-11-21T02:40:38Z

Merge activity

Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

MasterPtato requested a review from NathanFlurry November 18, 2025 23:43

MasterPtato mentioned this pull request Nov 18, 2025

chore: rename retry -> hibernate in runner protocol #3473

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless November 18, 2025 23:43 Failure

This was referenced Nov 18, 2025

fix(gas): fix batch listen, fix history for graceful signal send in workflows #3479

Draft

fix(pb): rewrite runner wf to handle batch signals #3480

Draft

MasterPtato force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from 6c5e080 to dc9084a Compare November 18, 2025 23:50

MasterPtato force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from e83ac9b to 00cd5f2 Compare November 18, 2025 23:50

vercel bot temporarily deployed to Preview – rivetkit-serverless November 18, 2025 23:50 Inactive

This was referenced Nov 18, 2025

fix(gas): fix loop forgotten bug due to concurrency #3482

Draft

fix(pb): simplify runner wf #3483

Draft

This was referenced Nov 19, 2025

chore(rivetkit): implement new hibernating ws protocol #3464

Closed

fix(actor-kv): fix listing keys #3484

Closed

vercel bot temporarily deployed to Preview – rivetkit-serverless November 19, 2025 01:28 Inactive

vercel bot had a problem deploying to Preview – rivet-site November 19, 2025 01:29 Failure

This was referenced Nov 19, 2025

chore(engine-runner): remove echo close on ToClientWebSocketClose #3487

Closed

chore(pegboard-gateway): remove TunnelAck #3488

Closed

jog1t mentioned this pull request Nov 19, 2025

refactor: performance improvements for inspector #3452

Draft

NathanFlurry force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from d34ad3e to 44b3d81 Compare November 20, 2025 22:32

NathanFlurry mentioned this pull request Nov 20, 2025

feat(cloudflare-workers): add createInlineClient #3466

Closed

MasterPtato force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from 44b3d81 to d34ad3e Compare November 20, 2025 22:38

NathanFlurry force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from 274d094 to d44f1d1 Compare November 20, 2025 23:32

NathanFlurry force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from d34ad3e to 44b3d81 Compare November 20, 2025 23:32

NathanFlurry mentioned this pull request Nov 20, 2025

chore(examples): add counter-next-js example #3506

Closed

NathanFlurry mentioned this pull request Nov 21, 2025

fix(engine): ensure first keepalive tick is not skipped to prevent timeout on ws hibernation #3507

Closed

MasterPtato force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from 44b3d81 to d34ad3e Compare November 21, 2025 00:48

MasterPtato mentioned this pull request Nov 21, 2025

chore: remove msg id parts from protocol, move id utils to runner protocol crate #3508

Closed

NathanFlurry force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from 274d094 to d44f1d1 Compare November 21, 2025 01:27

NathanFlurry force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from d34ad3e to 44b3d81 Compare November 21, 2025 01:27

This was referenced Nov 21, 2025

chore(rivetkit): buffer outbound tunnel messages #3509

Closed

docs: runner lifecycle #3510

Closed

chore(engine): upgrade vbare 0.0.4 #3511

Closed

This was referenced Nov 21, 2025

fix(runner-protocol): add noop converters for v1 -> v3 #3512

Closed

fix(runner-protocol): fix decoding gateway & req id from req id instead of message id #3513

Closed

chore(pegboard-runner): log runner messages #3514

Closed

NathanFlurry approved these changes Nov 21, 2025

View reviewed changes

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025

fix(guard): handle actor stopped during ws/req (#3481)

3891bf8

graphite-app bot closed this Nov 21, 2025

graphite-app bot deleted the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch November 21, 2025 02:43

fix(guard): handle actor stopped during ws/req #3481

fix(guard): handle actor stopped during ws/req #3481

Uh oh!

Conversation

MasterPtato commented Nov 18, 2025

Uh oh!

vercel bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 18, 2025

Pull Request Review

Summary

✅ Positive Aspects

🔍 Code Quality Observations

1. Duplicate Subscription Pattern

2. Runner WebSocket Behavior Change (Important)

3. Comment Accuracy

🐛 Potential Issues

1. Subscription Timing in Hibernation Flow

2. Error Propagation During Cleanup

🔒 Security Considerations

⚡ Performance Considerations

🧪 Test Coverage

🎯 Recommendations Summary

Final Verdict

Uh oh!

claude bot commented Nov 18, 2025

PR Review: fix(guard): handle actor stopped during ws/req

Summary

Code Quality ✅

Detailed Analysis

1. pegboard-gateway/src/lib.rs - Actor Stopped Handling

2. pegboard-gateway/src/shared_state.rs - Pending Messages Check

3. pegboard-runner/src/lib.rs - WebSocket Lifecycle Changes

Potential Issues & Suggestions

Minor Issues

Questions for Discussion

Performance Considerations

Security Concerns

Test Coverage

Recommendations

Verdict

Uh oh!

claude bot commented Nov 18, 2025

Code Review: PR #3481 - fix(guard): handle actor stopped during ws/req

Summary

Code Quality ✅

Potential Issues ⚠️

1. Subscription Created After Request Start (Medium Priority)

2. WebSocket: No Stopped Handling During Hibernation (Medium Priority)

3. Runner-side: Changed Semantics (Low Priority)

Performance Considerations ✅

Security Concerns ✅

Test Coverage ⚠️

Overall Assessment

Uh oh!

claude bot commented Nov 19, 2025

PR Review: fix(guard): handle actor stopped during ws/req

Overview

Code Quality & Best Practices

✅ Strengths

🔍 Potential Issues

1. Logic Error in is_ws_hibernate check (Critical)

2. Subscription Timing

3. Immediate Wake on Pending Messages

4. Missing Stopped Subscription in Hibernation Handler

Performance Considerations

✅ Good Practices

⚠️ Minor Concerns

Security Concerns

✅ No Major Issues

Test Coverage

⚠️ Missing Tests

Runner Changes

✅ Improved Clarity

Documentation

vercel bot commented Nov 18, 2025 •

edited

Loading

MasterPtato commented Nov 18, 2025 •

edited

Loading

1. Logic Error in `is_ws_hibernate` check (Critical)

1. Logic Error in Close Frame Sending (`lib.rs:563-566`)

2. Visibility Change May Need Documentation (`proxy_service.rs:2540`)

4. Inconsistent Comment (`lib.rs:385`)